Introduction

The rent dataset provided by TidyTuesday contains information about the rental properties in the Bay Area such as rental price, number of beds, number of baths, square footage, county, city, neighbourhood, descriptions, etc. The dataset was compiled by Dr. Kate Pennington in which she turned to Craigslist, a major platform for the United States rental market and Wayback Machine, a web archive, which provided snapshots of Craigslist apartment listings from September 2000 to July 2018 (Pennington, 2018). However, it’s essential to acknowledge certain limitations – The data may not represent the entire rental market, potentially missing extremes such as high-end properties or those communicated through word of mouth, and the temporal and spatial coverage is not continuous, as the Wayback Machine archives websites sporadically (Pennington, 2018).

The question of interest is: How do different factors affect the rental prices of apartments in the Bay Area?. We thus want to study the relationship of different predictor variables including features of the apartment and spatial and temporal effects, on the response variable, price.

Data Cleaning & Summary

Cleaning of rent dataset

The date variable was subdivided into year, month and day, enabling a more detailed exploration of temporal aspects. Additionally, the character variables, county, nhood and city, have been converted into factor columns for improved categorization and analysis. Next, variables such as latitude (lat), longitude (lon), and address, were missing for a large majority of the dataset, and thus could not be used for analysis. Thus, they have been excluded from the analysis together with the post_id variable which is not useful for our research. For identifying outliers in price data, we grouped price by year and county, since these are major factors that might influence price. We employed the interquartile range (IQR) method or Tukey’s Fences method, as defined by John Turkey. This method involves calculating the range between the first (25 percentile) and third (75th percentile) quartiles and outliers are defined as values significantly higher or lower than this range (typically 1.5 times the IQR above the third quartile or below the first quartile). While replacing outliers with the median is a common practice, it skewed our data towards the median, in this instance, leading to potential misrepresentations of the actual price distribution. Thus, we instead removed the outliers of price grouped by year and county.

# Load dataset
rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

rent_cleaned <- rent %>%
  mutate(date = as.character(date)) %>%
  mutate(date = paste(substr(date, 1, 4), substr(date, 5, 6), substr(date, 7, 8), sep = "-")) %>%
  separate(date, into = c("year", "month", "day"), sep = "-") %>%
  mutate_at(vars("year", "month", "day"), as.numeric) %>%
  mutate_at(vars("county", "nhood", "city"), factor) %>%
  group_by(county, year) %>%
  filter(price <= (quantile(price, 0.75) + 1.5 * IQR(price)), price >= (quantile(price, 0.25) - 1.5 * IQR(price))) %>%
  ungroup() %>%
  distinct() %>%
  select(-c(lat, lon, address, post_id)) %>%
  filter(!is.na(county))

head(rent_cleaned)
## # A tibble: 6 × 14
##    year month   day nhood city  county price  beds baths  sqft room_in_apt title
##   <dbl> <dbl> <dbl> <fct> <fct> <fct>  <dbl> <dbl> <dbl> <dbl>       <dbl> <chr>
## 1  2005     1    11 alam… alam… alame…  1250     2     2    NA           0 "$12…
## 2  2005     1    26 alam… alam… alame…  1295     2    NA    NA           0 "$12…
## 3  2004    10    17 alam… alam… alame…  1100     2    NA    NA           0 "$11…
## 4  2012     6     1 alam… alam… alame…  1425     1    NA   735           0 "$14…
## 5  2004    10    21 alam… alam… alame…   890     1    NA    NA           0 "$89…
## 6  2006     4    11 alam… alam… alame…   825     1    NA    NA           0 "$82…
## # ℹ 2 more variables: descr <chr>, details <chr>

Key Summary Statistics

The data contains 190437 observations after cleaning and the observations date across 2000 to 2018. From the table of summary statistics, we observe that the variables beds and baths do not take a well-distributed range of values, in that the 1st quartile is 1 and the 3rd quartile is 2, and using these variables for data visualisation may not give us a distinct trend as 75% of the data are within a small range of values. After removing outliers for price and sqft, we still observe that the data seems to be right skewed as the max is much higher than Q3. Nonetheless, since we have removed outliers, the data visualisations later on should not be significantly affected by extreme values. From the table of distribution of observations by county, we observe that some counties like Solano, Santa Cruz and Napa seem to be underrepresented while San Francisco seems to be over-represented, so we should avoid averaging values across counties due to the uneven representation of data across counties.

print(paste0("Number of observations after data cleaning: ", nrow(rent_cleaned)))
## [1] "Number of observations after data cleaning: 190437"
print(paste0("Observations date across ", min(rent_cleaned$year), " to ", max(rent_cleaned$year)))
## [1] "Observations date across 2000 to 2018"
# Create a template data frame for price, sqft, beds and baths
summary_stats <- data.frame(
  variable = character(),
  mean = numeric(),
  median = numeric(),
  min = numeric(),
  max = numeric(),
  sd = numeric(),
  q1 = numeric(),
  q3 = numeric(),
  stringsAsFactors = FALSE
)

for (var in c("price", "sqft", "beds", "baths")) {
  summary_stats <- rbind(
    summary_stats,
    c(
      variable = var,
      min = min(rent_cleaned[[var]], na.rm = TRUE),
      q1 = quantile(rent_cleaned[[var]], 0.25, na.rm = TRUE),
      mean = mean(rent_cleaned[[var]], na.rm = TRUE),
      median = median(rent_cleaned[[var]], na.rm = TRUE),
      q3 = quantile(rent_cleaned[[var]], 0.75, na.rm = TRUE),
      max = max(rent_cleaned[[var]], na.rm = TRUE),
      sd = sd(rent_cleaned[[var]], na.rm = TRUE)
    )
  )
}

colnames(summary_stats) <- c("Variable", "Min", "Q1", "Mean", "Median", "Q3", "Max", "SD")
print(summary_stats)
##   Variable Min   Q1             Mean Median   Q3   Max                SD
## 1    price 220 1255 1951.15675000131   1750 2400  7350  943.474030714827
## 2     sqft  80  740 1083.41231820658    980 1279 57067  680.503766240235
## 3     beds   0    1 1.80882608506756      2    2    12  1.01036606179289
## 4    baths   1    1 1.62127122246393    1.5    2     6 0.628131971986058
# Distribution of county
tbl <- table(rent_cleaned$county)
tbl_df <- as.data.frame(tbl)
colnames(tbl_df) <- c("County", "Frequency")
tbl_df
##           County Frequency
## 1        alameda     24659
## 2   contra costa     13087
## 3          marin     10447
## 4           napa      2073
## 5  san francisco     52914
## 6      san mateo     24165
## 7    santa clara     39097
## 8     santa cruz      8339
## 9         solano      3674
## 10        sonoma     11982

Data Visualisations

Let us now proceed to visualise the data. To answer the question: How do different factors affect the rental prices of apartments in the Bay Area?, we can look into different factors like geography, yearly trends and the properties of the rental properties. The discussion of the insights from the graphs is included at the end since there are some overlapping insights.

Visualisation 1: Geographical Analysis of Rental Prices in the Bay Area

To determine the influence of county on rental prices, we create a map of median prices in the year 2018. Data from Santa Cruz was omitted simply because of its significant geographical distance away from the other counties in the data set. This is an interactive plot. Feel free to hover over the counties for their specific values of median rental prices.

us_counties = map_data("county")

data_2018 <- rent_cleaned %>% 
  filter(year == 2018) %>% 
  group_by(county) %>% 
  summarize(median_price = median(price))

merged <- data_2018 %>% 
  left_join(us_counties, by = c("county" = "subregion")) %>% 
  filter(county != "santa cruz") %>% 
  mutate(county = toTitleCase(county))

label <- merged %>% 
  group_by(county) %>% 
  summarize(mean_lat = mean(lat),
            mean_long = mean(long)) %>% 
  mutate(county = recode(county, "San Francisco" = "SF"))

gg_plot <- ggplot(data = merged) + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = median_price/1000, text = paste("County: ", county, "<br>Median Price: $", 
                                format(median_price))), color = "#FFFFFF") + 
  scale_fill_continuous(name = "Median Price", low = "#8EC4FE", high = "#31418F", labels = function(x) paste0("$", x, "k")) +   
  geom_text(data = label, aes(x = mean_long, y = mean_lat, label = county), color = "white", size = 3) +
  theme_minimal() + 
  theme(legend.position = "bottom", 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank()) +
  labs(title = "Geographical Analysis of Rental Prices in the Bay Area in 2018")
## Warning in geom_polygon(aes(x = long, y = lat, group = group, fill =
## median_price/1000, : Ignoring unknown aesthetics: text
ggplotly(gg_plot, tooltip = "text")

Visualisation 2: Yearly Trend of Rental Prices in Different Counties

To determine the influence of both county and year on rental prices in the Bay Area, we plot mean price (grouped by county and year) across 2000 to 2018, split by county. Line graphs are especially suitable for time-series visualisations to capture the temporal evolution of rental prices. We used a distinct line for each county, using high contrast colours, so that we can uncover how prices differ between different counties in the Bay Area. This is an interactive line graph. Using the “Compare data on hover” button in the top right of the plot, you will be able to compare prices in each year across county.

tmp <- rent_cleaned %>%
  group_by(year, county) %>%
  summarize(mean_price = mean(price, na.rm = TRUE), .groups = "drop") %>%
  filter(!is.na(county))

countynames <- tmp %>%
  filter(year == 2018)

color <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")

p <- ggplot(data = tmp) +
  geom_line(aes(x = year, y = mean_price, col = county), lwd = 1, show.legend = FALSE) +
  ylim(c(0, 4000)) +
  xlim(c(2000, 2021)) +
  geom_text(data = countynames, aes(x = year, y = mean_price, label = county, col = county), nudge_x = 0.5, nudge_y = c(-50, 5, 20, 30, 0, 100, 10, -30, -50, 50),size = 3.5, show.legend = FALSE) +
  labs(title = "Mean rental price of property in each county from 2000 to 2018") +
  scale_color_manual(values = color) +
  ylab("Mean price") +
  xlab("Year") +
  theme_classic() +
  theme(panel.grid = element_blank(), # Remove gridlines
        axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)),
        axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 0, l = 0)),
        plot.title = element_text(face = "bold"), # Bold title
        plot.subtitle = element_text(margin = margin(t = 0, r = 0, b = 15, l = 0)),
        axis.title = element_text(face = "bold"))  # Bold axis label

p <- ggplotly(p)
p <- p %>% 
  layout(showlegend = FALSE,
         hoverlabel = list(font = list(size = 10))) %>% 
  style(textposition = 'middle right')
p

Visualisation 3: Relationship between Size of Living Area and Rental Price

The size of living area is a factor that might affect how much a renter is willing to pay for the flat, and how much the landlord would charge for it. We want to verify this assumption and thus plot a scatterplot of rental price against sqft of the apartment. Scatterplots are especially useful for identifying relationships between numeric variables. To further aid in this, we can also plot the best fit line, together with the slope, intercept and correlation coefficient, so we can better understand any linear relationships. Before plotting the scatterplot, we remove outlier values, simply defined as values smaller than the 1st quantile and larger than the 99th quantile. This is to make the graph neater by removing the few extreme values. Finally, the data was colour coded by county to provide more insight about the trend across different counties.

# Remove any abnormally large or small sqft values which are likely data errors
sqft_trend <- rent_cleaned %>%
  mutate(sqft = case_when(sqft <= quantile(sqft, 0.01, na.rm = TRUE) ~ NA,
                          sqft >= quantile(sqft, 0.99, na.rm = TRUE) ~ NA,
                          TRUE ~ sqft)) %>%
  select(price, sqft, county) %>%
  na.omit()

cor_coefficient <- cor(sqft_trend$sqft, sqft_trend$price, use = "complete.obs")

# Fit a linear model to the data
lm_model <- lm(price ~ sqft, data = sqft_trend)
# Extract the coefficients of the best-fit line
intercept <- coef(lm_model)[1]
slope <- coef(lm_model)[2]

ggplot(sqft_trend, aes(sqft, price)) +
  geom_point(aes(color = county), position = "dodge") +
  geom_smooth(method="lm", se = TRUE, color = "black", linetype = "longdash") +
  annotate("text", x = 50, y = max(sqft_trend$price), # Adjust x and y positions
           label = paste("Slope (b) =", round(slope, 2)),
           hjust = "left", vjust = "top", color = "black", size = 3) +
  annotate("text", x = 50, y = max(sqft_trend$price) - 500,
           label = paste("Intercept (a) =", round(intercept, 2)),
           hjust = "left", vjust = "top", color = "black", size = 3) +
  annotate("text", x = 50, y = max(sqft_trend$price) - 1000,
           label = paste("Correlation (r) =", round(cor_coefficient, 2)),
           hjust = "left", vjust = "top", color = "black", size = 3) +
  theme(plot.title = element_text(size = 16, hjust = 0, margin = margin(t = 15)),
        plot.subtitle = element_text(margin = margin(t = 0, r = 0, b = 10, l = 0)),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        axis.title.x = element_text(size = 14, margin = margin(t = 10)), # Add margin to x-axis title
        axis.title.y = element_text(size = 14, margin = margin(r = 10)), # Add margin to y-axis title
        plot.caption = element_text(size = 12),
        legend.position = "bottom") +
  scale_x_continuous(limits = c(0, 2900), expand = c(0, 0)) +  # Ensure x-axis starts from 0
  labs(title = "Effect of Size of Living Area on Rental Price",
       subtitle = "Relationship between square feet of the property and rental price",
       x = "Sqft",
       y = "Price ($)") 

Visualisation 4: Trend of prices across seasons

We plot a heatmap of prices across the months so that we can understand how prices change with season (represented by the month). In Visualisation 2, we saw that there is relationship between sqft and price. Thus, for a fairer comparison, in Visualisation 3, we computed the rental price per sqft and averaged it across the years (2000 - 2018), grouped by county and month. A sequential palette is used whereby the lighter shades indicate lower prices while the darker shades indicate higher prices, such that the color intensity in the heatmap serves as a visual indicator of the magnitude of rental prices per sqft.

data_average_monthly <- rent_cleaned %>%
  group_by(month, county) %>%
  summarise(price = mean(price/sqft, na.rm = TRUE))

ggplot(data_average_monthly, aes(x = factor(month), y = county, fill = price)) +   
  geom_tile() +   
  geom_text(aes(label = sprintf("%.3g", price)), color = "black", size = 3) +   
  labs(title = "Seasonal Changes of Rental Price Per Sqft in the Bay Area",
       subtitle = "Monthly Rental Prices Per Sqft Averaged Across 2000 to 2018",         
       y = "County", x = "Month", fill = "Mean Price Per Sqft") +   
  scale_x_discrete(breaks = 1:12, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +   
  scale_fill_gradient(low = "#ffff80", high = "#cc0000",                        
                      labels = function(x) sprintf("$%.3g", x)) +   
  theme_minimal() +   
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(panel.grid = element_blank(), # Remove gridlines
        axis.title.y = element_text(margin = margin(t = 0, r = 15, b = 0, l = 0)),
        axis.title.x = element_text(margin = margin(t = 15, r = 0, b = 0, l = 0)),
        plot.title = element_text(face = "bold"), # Bold title
        plot.subtitle = element_text(margin = margin(t = 0, r = 0, b = 10, l = 0)),
        axis.title = element_text(face = "bold"))  # Bold axis label

Insights

Visualisation 1 provdes insights about the geographical influence on rental prices. Notably, prices are highest in San Francisco and lowest in Solano. From Visualisation 2, across the years, there is a general increasing trend of rental price of property as seen from the approximately positive slope from the line graphs, likely due to the effects of inflation and worsened housing shortages. However, we observe that there seems to be a dip in rental prices in the late 2000s (2008 to 2010) in all the counties. This was likely due to the fact that the housing market experienced a significant downturn during the 2008 Global Financial Crisis and when the housing market bubble burst, home values plummeted from its previously inflated prices (Boyle, 2023). In Visualisation 3, there is an upward sloping best fit line with a gradient of 0.95, indicating a positive linear association between price and the size of living space. The moderate correlation coefficient of r = 0.42 further supports the meaningful relationship between these two variables, which is logical given that larger flats typically grant higher prices. Finally, from the heatmap in Visualisation 4, we observe a dip in price per sqft in the month of July across all the counties as seen from the lighter colour of the Jul column. In the US, July marks the beginning of Summer – For students, this is the end of the academic year when many choose to return home for the Summer, and for others, this is a common period for vacations, which may lead to a temporary surplus of available rental units. In all the graphs, we observe that counties like San Francisco followed by Marin have generally had the highest prices across the years compared to the other counties. Possible reasons for the higher prices in San Francisco and Marin may include these counties historically having limited land area for development and stricter zoning regulations, leading to high demand and limited supply. Additionally, these counties also have closer proximity to job centers and economic hubs, in which San Francisco rental prices are being pushed up by the “booming tech industry and proximity to Silicon Valley” (Towey, 2021). In contrast, rental prices in Solano are lower for similar-sized living spaces, being one of the northernmost Bay Area counties (Towey, 2021).

In conclusion, the visualizations offer an understanding of how different factors, including the year, season, geographical area and size of living area, Together, these insights highlight the complexity of the Bay Area rental market, where besides the living conditions of the apartment (e.g. size of living area), a combination of economic, geographic, seasonal and societal factors contributes to the observed variations in rental prices over time and across counties. Nevertheless, a major limitation of this study is the lacking geographical analysis of the Bay Area Rental Market to understand why counties like San Francisco and Marin are more desirable for renters while counties like Solano are less so.

References

Boyle, M. J. (2023, April 30). 2008 recession: What it was and what caused it. Investopedia. https://www.investopedia.com/terms/g/great-recession.asp

Pennington, K. (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018. Retrieved from https://github.com/katepennington/historic_bay_area_craigslist_housing_posts/blob/master/clean_2000_2018.csv.zip.

Pennington, K. (2018, August 30). Craigslist scrape methodology. https://www.katepennington.org/clmethod

R4DS Online Learning Community (2023). Tidy Tuesday: A weekly social data project. https://github.com/rfordatascience/tidytuesday.

SF.Gov. (2016, November 16). Building Permits [Data set]. https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9

Towey, H. (2021, July 15). Bay area renters have to make between $31 and $68 an hour in order to afford an apartment. Business Insider. https://www.businessinsider.com/san-francisco-bay-area-rent-least-affordable-housing-in-us-2021-7